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In this paper, we propose a parameter space augmentation approach that 
is based on "intentionally" introducing a pseudo-nuisance parameter into gen- 
eralized linear models for the purpose of variance reduction. We first consider 
the parameter whose norm is equal to one. By introducing a pseudo-nuisance 
V^J I parameter into models to be estimated, an extra estimation is asymptotically 

Q>^ ■ normal and is, more importantly, non-positively correlated to the estimation 

that asymptotically achieves the Fisher/quasi Fisher information. As such, 
the resulting estimation is asymptotically with smaller variance-covariance 
matrices than the Fisher/quasi Fisher information. For general cases where 
;<— ^ ■ the norm of the parameter is not necessarily equal to one, two-stage quasi- 

^^ i likelihood procedures separately estimating the scalar and direction of the 

parameter are proposed. The traces of the limiting variance-covariance ma- 
trices are in general smaller than or equal to that of the Fisher /quasi-Fisher 
^ , information. We also discuss the pros and cons of the new methodology, and 

J^ I possible extensions. As this methodology of parameter space augmentation 

is general, and then may be readily extended to handle, say, cluster data and 
correlated data, and other models. 
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1 Introduction 



For parametric regression models, efficiency of estimation is used as a criterion for 
estimation accuracy and is a well investigated issue, and relevant theories have been 
well developed in the literature. For instance, for linear models, when the design 
is fixed, Gauss-Markov theorem shows that the ordinary least squares estimation is 
the best linear unbiased estimation (BLUE, Markoff 1912). Plackett (1949) had a 
nice description on the contributions by Gauss, Laplace, Markov and others. Here 
'Best' means minimum variance. When the design is random, its limiting variance 
achieves the Fisher information. However, the unbiasedness of estimation is usually 
not achievable, particularly for generalized linear models (GLMs), a notion about 
'best' that can be explained for asymptotically unbiased estimation is also the Fisher 
information. 

The Fisher information is a way of measuring the amount of information that 
an observable random variable carries about an unknown parameter upon which 
the probability of the observable random variable depends. Fisher (1922, p. 316) 
presented the "Criterion of Efficiency" that refers to large-sample behavior: "when 
the distribution of the statistic tends to normality, that statistic is to be chosen which 
has the least probable error." "Efficiency" is efficiency in summarizing information 
in the large-sample case. 

The class of generalized linear models (GLMs) is one of the most widely used 
classes of regression models for statistical analyses. The relevant theoretical investi- 
gations are rather complete and the results of parameter estimation have been the 
standard results in the literature and textbooks. As is well known, the GLM is of a 
structure as 

hifi) = /3^X (1.1) 

where /i = E(Y\X), Y is the response and X is the predictor vector of p dimension, 
f3 is the parameter of interest, and h is a. given link function. It is also often written 
as 

Y = g{(3^X)+e, (1.2) 

with E{e\X) = and a given function g although (1.1) and (1.2) are not exactly 
equivalent unless g is invertible. Without loss of generality, assume that E{X) = 
throughout the present paper. 

Suppose that an independent identically distributed sample {(xi, yi),- ■ ■ , (a;„, ?/„)} 
is available. To estimate the parameter /5, there are several standard methods and 
among them, the weighted least squares and the quasi-likelihood proposed by Wed- 
derburn (1974) are the most popularly used methods when the distribution of error 



is not parametric. The estimation of /3 via the quasi-hkehhood can achieve the 
asymptotic efficiency: when the underlying distribution of error is natural expo- 
nential, the limiting variance-covariance matrix can be, or be proportional to, the 
Fisher information that attains the Cramer-Rao lower bound (Cramer, 1946; Rao 
1945). In effect, it is asymptotically efficient among all consistent solutions to linear 
estimating equations, see Wedderburn (1974), McCullagh (1983), Jarrett (1984), 
McLeish (1984), Firth (1987) and Godambe and Heyde (1987). A relevant reference 
is Godambe (1960) who proved the optimum property of the Fisher information in 
the likelihood estimation. When we do not impose, other than some moments of 
the error, specific distributional conditions, the limiting variance-covariance matrix 
is called the quasi-Fisher information, see e.g. Wefelmeyer (1996). These results 
have long been benchmarks to be compared for estimation efficiency of any new 
method. Le Cam (1986) is a comprehensive reference book about rigorous proofs 
of the asymptotic optimality in this area. In recent years, some approaches were 
developed to improve generalized estimating equations by using quadratic inference 
functions for cluster data, see Li and Lindsay (1996) and Qu, Lindsay and Li (2000), 
and Wang and Qu (2009). 

Note that all of existing estimation methods are within parametric framework. 
Nevertheless, this is a natural and seemingly necessary way for a parametric problem. 
As such, it seems that we may not need to bother ourself to estimate anything else 
other than the parameter of interest if there are no other nuisance parameters in 
the model. From a heuristic perspective in both practice and theory, if we did so, 
some extra estimation errors would be created and estimation variation would be 
even larger. 

However, our study shows that this is not always true. In this paper, we propose 
a two-stage estimation method to estimate the scalar ||/3o|| and the direction /3o/||/5o|| 
separately to form a final estimator of (3q where || ■ || is the Euclidean norm. Although 
both the stages are based on the quasi-likelihood, the estimation for the direction 
/3o/||/3o|| is a semiparametric quasi-likelihood by introducing a pseudo-nuisance pa- 
rameter into models to be estimated and regarding the link function g as its true 
value. In the next section, we will give a detailed motivation to describe why we 
do this. The results are interesting: the traces of the limiting variance-covariance 
matrices are smaller than or equal to those of the Fisher/quasi-Fisher information, 
and in some cases, particularly for Pq with norm one, the matrices themselves are 
smaller than or equal to the corresponding Fisher/quasi-Fisher information. Also 
the reduction of variance is highly related to the correlations among the predictors. 
We will see this in the theorems in Section 4. Two relevant references are Wang, 
Xue, Zhu and Chong (2010) and Chang, Xue and Zhu (2010) about the least squares. 
Its special case is the linear model. Our result shows that although our estimation 
is only asymptotically unbiased, trace of the variance-covariance matrix is smaller 
than what Gauss-Markov theorem provided in independent identically distributed 
case. In the next section, we shall present the details of the estimation procedures 



and the results. Of course, as a trade-off, to facilitate the use of nonparametric es- 
timation, technical conditions on the predictor vector are much more stronger than 
those the classical methods need. 

It is worth pointing out that this parameter-space augmentation approach has 
a root from the estimation problems with nuisance parameters. For example. Piece 
(1982) pointed out that when nuisance parameters in models are estimated, the es- 
timation of the parameter of interest may be more efficient than that when nuisance 
parameters are regarded as given. The difference from his result is that as a method- 
ology, we create a nuisance parameter to be estimated and regard the given link as 
its true value. This methodology may be useful for other estimation problems. We 
will have discussions in Section 5. 

The paper is organized as follows. In Section 2, we shall in detail describe the 
motivation of our method. Section 3 presents the estimation procedures. Section 4 
presents the results. As the results are related to given variance function of error 
and more regularity conditions than the classical quasi-likelihood needs for technical 
purpose, we shall in Section 5 discuss pros and cons of the new methodology and 
some extensions. The proofs of the main results are relegated to Section 6, the 
Appendix. 



2 Motivation of methodology development 



2.1 A brief review of the quasi-likelihood 

For the model of (1.2), suppose that a sample {(xi, yi), ■ ■ ■ , (x„, ?/„)} is available. In 
a simple case with a given variance function v{X) = var(Y\X), the quasi-likelihood 
estimator /3 is the solution to the following equation 

n 

G{P) = J2(y^ - 9W''x,))g'{P''xi)x,/v{x,), (2.1) 

over all /3. It is worth pointing out that for this v{-), the above is equivalent to 
weighted least squares (WLS). When the variance can be written as a'^v{X) where 
0"^ is an unknown dispersion parameter, we can still use v{X) in the estimating 
equation. However, when v{X) is of the structure v{/3qX) or v{g{f3QX)), it is 
different from the WLS, and we will use it in G(/3) in the lieu of v{X). As the proof 
is almost identical to the case with v{X), and the results are also almost unchanged, 
even simpler in form, for the limiting variance of the estimator f3. Further, when 
function v oiv{g{/3QX)) is also unknown, we need to use a nonparametric estimation 



to replace it. The result is also similar. The extensions will be discussed in Section 5. 
Thus, we for simplicity of presentation use v{X) throughout this paper. 

From the projection theorem (Small and Mcleish, 1994, p. 79), the quasi- 
likelihood equation G{(3) of (2.1) is the optimal linear combination of {yi—g{f3'^Xi)ys. 
The asymptotic normality of (3 can be derived by using Taylor expansion and Slutsky 
theorem to obtain an asymptotic presentation as 

1 " 
~r ^ ei9'{(3oX^)xi/v{xi) 

- Y,(9'{f3oX^))Vv{xi)x,xf^V^0 - /3o) + Op(l) 
=: VV^0-(3o) + Opil). (2.2) 



with V = E((g'{PlX)f/v{X)XX^) and then 

v/^(/3-/3o)^iV(0,y-^) (2.3) 

where "^^" stands for convergence in distribution, and V = Ey [g' {(5^ X))"^ / v{X)X X"^ 

The matrix V~^ is called the quasi-Fisher information attaining the minimum among 
all estimations when we use the residuals j/j — g{f3'^Xi) to form linear combinations. 
Nevertheless, the above deduction is of course under certain regularity conditions. 



2.2 Motivation: variance reduction by an extra "estima- 
tion" 



To simplify the motivation for our estimation procedures in the next subsection, 
we give, for the time being, four assumptions among which the first three will be 
removed in the next section. First, assume f (■) = 1 for the moment. Second, 
we consider for the time being the case where that /3 is of norm one: \\(3\\ = 1. 
This is because the following arguments do not make sense for general f3 due to 
identifiability problem. This is also the reason why we need two-stage estimation 
that will be described in the next section. Third, to reparametrize /3 when its norm 
is equal to one, we assume that we have a parameter 6 oi p — 1 dimension so that 
the px (p— 1) Jacobin matrix J{6) = 8(3/89 exists. The details will be presented in 
the next section as well. Forth, the dimension of /3 is greater than one, otherwise, 
our approach can improve nothing for the classical quasi-likelihood. 



From the title of this subsection, it is a natural question to ask why and how 
we consider introducing an extra "estimation" when we use the quasi-likelihood. As 



is known, the quasi-likelihood is motivated from the hkehhood with given density 
function of the error, or the weighted least squares when the distribution of the 
error is unknown. For simphcity of illustration, we will consider the ordinary least 
squares (OLS) that is equivalent to the likelihood when the error term follows the 
standard normal distribution A^(0, 1). 



2.2.1 Motivation from estimation criterion 

The idea of the OLS is to search for a function g with /3 so that g{P'^X) can fit 
Y best in the OLS criterion. The basic probability theory tells us that the best 
g{(3'^X) should be the conditional expectation of Y given P'^X. Note that in the 
GLM E{Y\I3'^X) = giPjX) with respect to the joint distribution F(x, y) of (X, Y). 
Thus under the parametric framework, a natural choice is not to search for function 
g, while to use g and then to consider the class of g{(3'^X) over all (3. Thus, the OLS 
criterion is defined as M(/3) = E{Y — g{P'^X)y for any /3. It is easy to see that the 
true parameter /3o is the minimizer of M(/3) over all /3 because 

E{Y - g{fX)f = E{ef + E{g{PlX) - g{fX)f. 

The derivative of M(/3) with respect to 6 leads to the population version of the 
estimating equation G'(/3): 

E(eg\P^X)J{efx') = E{[g{P^X) - g{PlX)]g'{P^ X)J{ef X^ 

When the expectations in the both sides are replaced by their empirical versions 
that are based on the sample {(xi,?/i), ■ ■ ■ , (x„, ?/„)}, we can obtain the estimating 
equation that is a special case of the quasi-likelihood. When /3 is close to /3o at 
certain rate, 9 close to 9q that is associated with /3o, and g'{l3'^X) ^ 5f'(/3^X), the 
application of Taylor expansion yields that 

E[eg'{P^X)J{9ofx) 
^ E(^{g'{(3^X)fj{9ofXX^J{9o)){9 - 9o). (2.4) 

The limiting variance-covariance matrix of 6* — ^o would be V~^ where 

V = E(^g'{P^XYJi9orXX^Ji9o)). 

Further, since /3 — /3o ~ J{9q){9 — 9o), we can have that the limiting variance- 
covariance matrix of /3 — /3o would be J{9q)V~^J{9o)^. 

This method is entirely dependent on the parametric structure the model un- 
der study is of. As we mentioned above, at population level, Po is the minimizer 
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(solution) of M(/3) (G(/3)), and gi^P^X) is the conditional expectation E{Y\[3qX). 
However, at sample level, the minimizer (solution) of the empirical version of M{I3) 
(G(/3)) (3 is not exactly equal to the true value /3o- As such, the search based on 
M{P) is actually a directional search along the direction with the fixed g that is 
not necessarily equal to the conditional expectation of Y when /3^X is given. This 
results in that the related minimum /3 may not attain the minimum that can be 
obtained by searching it in a larger subspace of functions { g{0^-): all g and /3 }. 
That is, 

E{Y-g{fX)f = E{Y-E{Y\fX)f + E{E{Y\fX)-g{fX)f 
> EiY - E{Y\l3^X)f = inf E{Y - g{f3^X)Y. 

9 

This shows that the residual Y — g{P^ X) = e — {g{l3'^X) —glP^X)) may be of larger 
variability than Y - E{Y\l3^X) =e- {EiY\l3^X) - giPjX)) is of when /3 7^ /Jq. It 
suggests that in the estimating equation, using E{Y\I3'^X) may lead an estimation 
of /3o with less variability in terms of {E{Y\/3^X) - g{(3^X)) = {E{Y\^^X) - 
E{Y\f3QX)) than that in terms of {g{(3^X) — g^P^X)). The application of Taylor 
expansion at (3o gives us an expectation that (3 — j3q obtained from the former would 
have smaller variance than this difference obtained from the latter. Then, we can 
consider the solution of 

e{{Y - E{Y\P^X)))g\fX)J{efX = 0. 

Equivalently, writing E{Y\PlX) = E{e\/3^X) + E{g{/3^X)\/3^X), 

E{{{e - {E{Y\I3^X) - g{f,X))))g'{P^X)J{efx] 

= e(((£ - E{e\l3^X)) + (giP^X) - E{g{l3^XMX)))g\l3^X)J{efX^ 

= E{[E{Y\P^X)-EiY\f3^X)]g'{^''X)J{9fxy (2.5) 

Here we use the derivative g' oi g rather than the derivative E'{Y\I3'^X) oi E{Y\I3'^X) 
only because we try to avoid too many unknowns to be estimated. Note that in this 
presentation, although both of them are zero at population level we do not delete 
£'(e|/3(fX) and {g{f3QX)—E{g{(3QX)\f3^X)). This is because they are not necessarily 
zero at sample level, and will play an important role in variance reduction. When f3 
is close to /3o at certain rate, and also 6 close to 60 that is associated with /3o, noting 
that conditional expectation is a self-adjoint operator, and g'{P^X) ^ g'^P^X), 
Taylor expansion for (2.5) yields that 



E[{{e - E{e\P^,X)))g\(5^X)J{e,fX^ 

E{e[g\P^X){X - E{J{6,f X\PIX))]^ 

E(^{g\(3^X)fj{9ofXX^J{eo)){9 - 9o) 

-E(^{g{P^X) - E{g{(3^XMX))g'{(3^X)J{9ofx). (2.6) 



In terms of (2.5) and (2.6), we can make a comparison with the classical least squares 
or likelihood. Asymptotically, our approach makes the equation to have an extra 

term E({g{/3^X) - {E{Y\(3^X)))g'{^^X)J{efx) at the true value of /3o. This term 

does not come up in the classical least squares because by the model structure, this 
term equals zero at population level, and thus, in classical framework, we do not 
bother ourself to "estimate" it. More importantly, from (2.6), we can see easily 

that Ef{g{(3^X) - {E{Y\(3^X)))g'{^JX)J{eofx) is non-positively correlated to 

E(e)g'{l3^X)J{eofx\ The correlation is actually between E(eg'{l3^X)J{eofX 

and —E ( E{e\(3QX)g'{/3QX)J{9Q)^X ) . This helps us attain smaller variance at sam- 
ple level. The details are in the following. 

2.2.2 The gain from estimating E{Y\(3'^X) 

In the above presentation we note that the conditional expectation function E(Y\I3'^X) 
is a nuisance parameter to be estimated. From the approximation of (2.6), if we can 
define an estimator E(Y\P'^X) of E(Y\I3'^X) that satisfies the following require- 
ments, an asymptotically more efficient estimation of (3q can be expected. First, 
the estimator can make the second term on the right hand side of (2.6) to be 
asymptotically negligible. Second, its derivative at /3qX, E'{Y\(3'^X) converges 
to g'{l3jX) at certain rate. Third, more importantly, an estimation of E{Y\P'^X)) 
would create an extra "estimation" E(Y\f3QX)) oi g{(3jX) so that the weighted sum 
oi g{f3QXi) — E(Y\l3QXi)ys is non-positively correlated to the weighted sum of e^'s for 
the purpose of variance reduction. From the above (2.5) and (2.6), we can see that we 
do get an extra term about — £'(e|/3jX). Specifically, such an extra estimation is not 
negligible, that is, E{Y\P^X)-g{f3^X) = E{e\f3jX) + {E{g{^^XMX)-g{P^X)) 
is not negligible. Technically, we will prove that (-E(5f(/3^X)|/3^X) — g{f3QX)) 
is negligible. Thus, the gain we can have is from E{e\l3QX). Without confu- 
sion, here E means estimation of conditional mean and of unconditional mean 
for its appearance in different places. Note that E{eg' {(3^ X) J {6qY X) is asymp- 
totically normal and helps us achieve the Fisher information. When Ey {g{(3QX) — 

E(Y\f3^X))g'{(3^X)J{eofx) ^ -E(E{e\f3^X)g'{f3^X)J{eofx) is non-positively 

related to E{eg'{/3QX)J{6o)^X), we will have chance to have a smaller variance than 
the Fisher information. This is just the case with our approach. The covariance 
between these two terms is asymptotically equal to 

-Cov := -E{g'{P^XfE{J{d,YX\P^X)E{J{d,)X^\P^X)). 

Regardless of some technical details for nonparametric estimation, this calculation 
can be easily justified from (2.5) and (2.6). Interestingly, We can also derive that 

E{ {g{(3QX) — E{Y\/3qX)) ]g'{(3QX)J{9Q)^X has a limiting variance-covariance ma- 
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trix equal to Cov. In other words, the joint distribution of 

{E{eg\P^,X)J{eofX), E{{g{P^X) - E{Y\P^X))g\P^X)J{eof X)) 

is asymptotically normal with the covariance matrix 

E{g'{PlXfJ{e,YXX^J{d,)) -Cov 



—Cov Cov 

Thus, the sum of these two terms is asymptotically normal with the variance- 
covariance matrix 

Q = V - Cov. 

We can also obtain it from (2.6). Therefore, an extra estimation E{Y\P'^X) of 
E{Y\[5'^ X) automatically introduces an extra "estimation" E{Y\[5q X) of £'(y|/3(fX) = 
5'(/3^X) that can help us achieve the reduction of variance. 

By comparing the two sides of (2.6), the limiting variance-covariance matrix of 
O-Oq would be V-^QV-^ where 



Q = E[g'{p]^Xfj{d,f{X - E{X\(5lX)){X^ - E{X^\p];X))J{d,] 
and recalling its definition 

V = E{g\p^xfj{e,Yxx^j{e,)). 

It is clear that V —Q = Cov is a non-negative semidefinite matrix and then V~^QV~^ 
is smaller than or equal to V~^. From this, and /3 — /3o ~ J{6q){6 — Oq), we may 
obtain that the limiting variance-covariance matrix would be J{6q)V~^QV~^ J{dQ)^ . 
This matrix is smaller than or equal to J{6q)V~^J{6q)^. 

The above "super-efficiency" can also be found to have a root from estimation 
problems with nuisance parameters as we mentioned in Section 1. For instance, 
suppose that we have a model with two sets of parameters: /3o, the parameters 
of interest; and A, nuisance parameters. Piece (1982) found that an estimation 
with estimated A can sometimes be more efficient than an "estimation" with the 
true value of A. In our setting, we regard the conditional expectation E(Y\-) as a 
nuisance parameter that is actually a pseudo-nuisance parameter for the models, and 
the link function g{-) is regarded as its true value. In the classical methodologies, the 
true value is used in estimations, whereas in our method, an estimation of E{Y\-) is 
plugged in. 



3 Two-stage estimation procedures 



From the above explanations about the least squares and likelihood, we can see the 
intrinsic difference of our approach from the classical estimations. This motivates 
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us to use the following quasi-likelihood for the purpose of variance reduction: 

n 

SG{P) = Y,{y^ - E{Y\[5^Xi))g'{P''x;)x,/v{x,)h{x;), (3.1) 

where E{Y\0^ X) is an estimator of E{Y\I3'^X), In{xi) = I{ff^{pXi) > Cq} for some 

positive value Cq, and / and 70 are respectively estimators of the density function 
of (3qX and /3o in (3.13) of Section 3 where (3q is replaced by 70, the direction of /3o. 
We can see the details in Section 3 below. The truncation /„(xj) is employed here 
for technical purpose to handle the boundary points particularly when we use local 
smoothing method to estimate the nonparametric regression function E{Y\P'^X), 
see Xia, and Hardle(2006). 

However, the above arguments are only feasible in the case where the forth as- 
sumptions stated in the beginning of this subsection satisfy. The constraint \\Po\\ = 1 
is intrinsic for our approach because for general /3, E{Y\I3'^X) = E{Y\cP'^X) for 
any constant c and then the uniqueness of solution /3 cannot be guaranteed. To deal 
with this identifiability problem, we in the following propose two-stage estimation 
procedures. 



3.1 Re-parametrization of the parameter 

We note that although /3q is not identifiable in the above procedure, its direction 
/3o/ 1 1 /So II can be so when we apply the idea described in the previous subsection. 
As such, we can either estimate the direction first or its scale ||/3o|| first. Write 
Q^o = ll/^oll and 70 = /3o/||/5o||. It is easy to see that 70 is identifiable when we 
rewrite the original model as y = E{Y\'~fQX) +e. We can work on estimating 70 by 
using a similar estimating equation of (3.1) and ao can be estimated by the classical 
quasi-likelihood. However, we note that 70 is on the boundary of the unit sphere 
surface and then we cannot directly derive the scores in the quasi-likelihood. To 
deal with this, a re-parametrization is necessary. A popular re-parametrization is 
the "remove-one-component" method on /3 as in Yu and Ruppert(2002) and then 
Wang, Xue, Zhu and Chong (2010) and Chang, Xue and Zhu (2010). Without loss 
of generality, we may assume that the true parameter 70 has a positive component, 
say 7or > for 70 = (701, • • • ,70?)"^ and 1 < z < n. For 7 = (71, . . . ,7^)"^, let 
ry{r) — (^^_^^ 7r-i, 7r+i5 ■ ■ ■ , Ip)'^ bc the p — 1 dimensional parameter vector after 
removing the rth component 7^ in 7. We may write 

7 = 7(7^'^^) = (7i,...,7.-i,(l-||7^^^Ml')'^7.+i,...,7pf- (3.2) 

The true parameter 7q must satisfy the constraint || % ||^< 1. Thus, 7 is infinitely 
differential in a neighborhood of 7q , and the Jacobian matrix is 

-/(7^'^^) = ^ = (5i,---,5pr, (3.3) 
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where 5s{l < s < p) satisfy that (5s = e^ for 1 < s < r,Sr = —(1— || "7^^^ ||^ 
)-i/2^W^^^ = es_i for r + 1 < s < p. Here e^ is a p — 1 dimensional unit vector 
with sth component 1. For this re-parametrization, we can also prove that J(7*^'''*) 
is orthogonal to 7. However, the columns within J{'j^^^) are not orthogonal. Let 
^(^W) = (^^ J(7('^))). We can have 



A(7(^'))-A(7(^)) = , _j.ir),T_j.,,. 



1 0^ 

J(7(^-))^J(7^ 



p— 1 p— 1 p— 1 

Yl cos(^i), sin(^i) JJ cos(^i), sin(^2) H cos(^*) ' ' 

i=l i=2 i=3 


■,sin(Vi))^, (3.5) 



although it is not a diagonal matrix. Also, we can easily prove that ^1(7'^'')) is 
invertible, and its inverse is 

A(,i'>) = (^h<'>)7-' ( I ,(,,.,)?„^„.) ) . (3.4) 



An alternative is re-parametrization by the polar coordinate system, to simplify 
the presentation, we assume with no loss of generality that the p-th component of 
7 is positive. That is, letting 9 = (6'i, • ■ ■ ,9p_i)'^ whose domain is the {p — 1)- 
dimensional subspace (0, Tr)^"^ x (0,7r/2), 

7 = 7(^) = 

where Y[^=p cos(6'j) is defined as 1. Otherwise, sin(6'p_i will be the r-th component 
that is positive. The Jacobian matrix J{6) = d'-f{6)/d6 is of a special structure: the 
i-th column is the derivative of 7(6') about 6i, in the first i components, sin(6'j) and 
cos(^i) are respectively replaced by cos(6'j) and — sin(^j), and the other components 
in the column are equal to zero because the corresponding components do not have 
6i. It is easy to prove that all the columns are orthogonal to each other. Furthermore, 
J{6) is orthogonal to 7(6'). A brief justification can be done by induction. When 
p = 2, the conclusion is clearly true. Assume that when p = m — l the conclusion is 
true. When p = m, 'j^n = (cos(6'm)7^_^, sin(6'm))^- For the derivative about 9i with 
i < m — 1, 

Then for any I < j < i, 

When i = m, d'yrn/d9m = (sin(6'm)7m-i, cos(6'm))-'", we have for any j < m 

{d^l/d9^){d^jd9^) 



'3 

cos(6'^)sin(6'^)(97,^_i/96'j)7^_i = 0. 



\Yy[9„:)d^l_Jd9,, 0)(cos(^^)7Li, - sin(^^))^ 
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But A{9) := (7(6'), J{9)) is not an orthogonal matrix although 



J{9fJ{9) 

is a diagonal matrix as not all elements are equal to 1. Also, we can easily prove 

that 



where A~^ is the Moore-Penrose inverse of matrix A in case A is not invertible. This 
will be useful in the later analysis. 

We can see that the two re-parametrizations have some similar properties. But, 
the former is easier to compute and ^(7^''^) is always invertible whereas the latter 
has nicer orthogonality structure. Throughout the rest of the present paper, we 
assume with no loss of generality that A{6) is invertible in the following, 6 can be 
either 7*^'') in the former or 9 in the latter. 



3.2 Estimation procedures 

Now we are in the position to present estimation procedures. Note that E(Y\^qX) = 
g{ao'jQX). The derivative of (7(«7(^)"^X) about 6* is 5''(a7(6')"^X)Q! J(6')-^X. Further, 
as the estimation will involve nonparametric smoothing, the boundary effect needs 
to be dealt with, we will in the estimating equation trim off some boundary points 
in the following procedures. 

Procedure 1. 

Step 1. Obtain an initial estimator of /3o, /3/ by the estimating equation of (2.1), and 
then let 7/ = /3//||/3/||. Check the values of all components of 77. Select an r with 
1 < r < p such that |7/ | = maxi<Kp I7} |. As 7}^ is not necessary to be positive, 
we then define d/ = sign{xi )\\f3i\\. As such, we can define the domain r(0) of 7(6') 
with the constraint that the rth component of 7(6') is positive. 
Step 2. Estimate ^0 by the solution 6 of 

n 

SG{e) = Y,{y^ - E{Y\^{efxi))g'{a^{eYxi)j{eYxi/v{x;)Ux,) = o. (3.7) 

over the domain r(6), where In{xi) is a truncation function in (3.1). 
Step 3. Obtain the final estimator of (3 as I3f = ai-^iO). 

Another procedure uses an iterative algorithm with one more new Step benefiting 
from the variance reduction estimation of 7 above. 
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Procedure 2. 

Step 1. Obtain an initial estimator of /3o, /3/ by the estimating equation of (2.1), 
and then let 77 = /37/||/3/||. Define «/ = ||/3/||. 
Step 2. Estimate 9q by the solution 6 of 

n 

SG{e) = ^(y, - E{Y\^{efxi))g'{a^{efxi)J{efxi/v{xi)in{x,) = 0. (3.8) 

over all 6*), where In{xi) is a truncation function in (3.1). 
Step 3. Estimate a^ by the solution ap of 

n 

Gia) = Y,iy^ - g{oil{efxi))g'{a-i{efx,)-i{efx,/v{x,)in{xi) = 0. (3.9) 

Step 4- Obtain the final estimator of /3 as /^^t^ = aF'-f{6). 

Remark 3.1. The two procedures are different mainly for Step 1 and Step 3. When 
we assume that a component of 7 is positive, the scalar a is then not necessary to 
he positive. To avoid this identification issue about sign, we define an aj with sign 
function in Step 1 of Procedure 1. But this is no need for Procedure 2 as after having 
an estimation of-f, we re-estimate the scalar a, with which, we can adoptively have 
the sign of a in the final estimation. 

Remark 3.2. Trimming boundary points off is a typical technique in nonparametric 
estimation when local smoothing is applied. We here simply use In{xi). Similar 
trimming can be found in Xia and Hdrdle (2006), and Xia, Hdrdle and Linton 
(2009) who used a smooth function instead of the indicator function. To achieve the 
results without such a function appeared in the limiting variance, that is, In{xi) goes 
to 1, we should let cq tend to zero as n goes to infinity. It is achievable as long as 
Co tends to zero at certain rate not too fast, see Xia and Hdrdle (2006), and Xia, 
Hdrdle and Linton (2009) for more details. Thus, we will not discuss the choice of 
Cq in detail. 



3.3 Plug- in Estimation of E{Y\-f^X) 

In the estimation procedures above, we need a plug-in estimation for E{Y\'-f'X). 
Here we consider local linear smoother. Suppose that (xi,|/i),l < i < n are inde- 
pendent identically distributed (iid) from model (1.2). To estimate nonparamet- 
rically E{Y\'y'^X), the local linear smoother with the linear weighted functions 
{Wni{z,'y,h);l < i < n} is defined in the following. The local linear estimators 
of E(Y\^'^X) and its derivative are the pair (a, 6) minimizing the weighted sum of 
squares (see Fan and Gijbels 1996) 



Eb- 



Hxh - :)Y R\(xh - :), (3.10) 



1 = 1 
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where Kh{-) = K{-/h)/h with K being a symmetric kernel function on the real line 
and h = /i„ being a bandwidth. 

Without confusion, we use the notation g{z] 7, h) := E{Y\'y'^X = z). g{z; 7, h) = 
a is the resulting estimator. Via a simple calculation, we have 



g{z;-f,h) = ^Wni{z;-f,h)yi, (3.11) 

where, for 1 < i < n, 

WUz;7,h)= !!''fj'^\ (3.12) 

with 

Uni{z;^,h) = [Sn,2{z;^,h) - {xj-f - z)Sn,i{z]-f,h)]Kh{xJ-f - z), 

1 " 

Sn,r{z;-f,h) = -y^(xj-f- zYKh{xJ-f- z), r = 0,1,2. 
For the density function f^{'y'^X), we can use the following estimator 

n 

/^(7^x,) = {nhY'Y. Un^{l^xf, 7, h), (3.13) 

and 7 = /?// 11/3/11. 

4 Main results 



By the classical quasi-likelihood for the GLM, we can have an initial estimator /3/ of 
/3o that is root-n consistent. It implies that 7/ is also root-n consistent to 70. With 
this pilot estimation, we can then consider a neighborhood of 7/ which contains 70 
to define our resulting estimator. Let i3„ = {7 :|| 7 — 7/ ||= Bn~^''^} where B is 
some positive constant. 

Theorem 4.1. Suppose that conditions C1-C6 hold. Let the solution of (3.8) he 
'j{0). Assume further that f{x) > Cq and f.^ipf'^x) > Cq for all x in its support. We 
have 

MliO) - 7o) A iv(o, Jieo)v~'Q^v~'Jieof/al), 

whereQ, =: J{e,fQJ{e,) = E{\e\'g'{PlXYJ{eonX/v{X)-E{X/v{X)\^lX)][X/v{Xy 
E{Xlv{X)\^lX)YJ{9,)}, V = E{[g'{PlX)fJ{eofXX^/v{X)J{e,)}. 
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Remark 4.1. The above result is obtained under the condition f{x) > Cq and 
/^(7-^x) > Co for all X in its support for simplicity. When these conditions do not 
hold, in (3.8), In{xi) = I{f^{^'^Xi) > cq} is not equal to 1, where, recalling its defi- 
nition in (3.13) of the previous subsection, fx^ify'^Xi) = (n/i^)"-^^"^-,^ f/„j(7^Xj; 7, /i), 
7 = /3//||/37-||. From the proof of Lemma 6.2, we can see that f^{z) converges to a 
function /i2/^g(2) where fi2 to be specified in Appendix is related to the kernel func- 
tion in local linear smoother. In the place of X/v{X) we should use X/v{X)I{X) 
where I{X) is the limit of In{X) as n tends to infinity. The above variance in 
the theorem can be regarded as the limit as Cq go to zero at a certain rate. All 
the results below will be with /„(X) = 1 for simplicity. Otherwise, the proof will 
be similar with more tedious computation, and the result will be substituted by 
Q,=E{\e\^g'{l3^Xrj{eor[Xi{X)/v{X)-E{Xi{X)/v{X)\^^X)][Xi{X)/v{X)- 
E{XI{X)/v{X)\^^X)fJ{eo)} =: J{eofQJ{eo), and 

V = E{[g'{f3^X)]'^J{9ofXX^I{X)/v{X)J{9o)} accordingly throughout this whole 
paper. 

Remark 4.2. For the single-index model, Wang, Xue, Zhu and Chong (2010) and 
Chang, Xue and Zhu (2010) also used estimating equations. However, their ap- 
proaches are based on the least squares and the estimating equations are the deriva- 
tives of the least squares criterion. Thus, although their estimation is asymptotically 
more efficient than existing ones such as Hdrdle, Hall and Ichimura (1993) and Xia 
and Hdrdle (2006), it cannot have optimal properties what the quasi-likelihood can 
have. More importantly, when v{X) is of a structure v{PqX) for which we will have 
discussions in Section 4, we cannot usually define consistent estimation from the 
estimating equations derived from weighted least squares, see Heyde(1997, page 4)- 
Also, as in our setting, we can directly use the derivative g'{P'^X) as g' is known. 
This is also different from the single-index model with unknown g' . The estimation 
procedure then has less nonparametric smoothing involved. 

To make a comparison with the one derived by the classical quasi-likelihood, we 
give a proposition. 



Proposition 4.1. Let (3 is the quasi-likelihood estimator of f3o, and let 7 = 
a = \\(3\\. Then the limiting variance of y/n{a — ao) is 7(fV^~"'^7o and the limiting 
variance of v^(/3/||/3|| - 70) is {0^, J{eo)){A{ej,fVA{eo))-\Op, J{9of /al This 
variance is greater than or equal to J{Oq)V~^QiV~^J{Oq)^ /a^ in Theorem 4.I. Here 
V = E[g'^{l3QX)XX'^/v{X)] and Op is a zero vector of p dimension. 

Remark 4.3. This proposition also shows that when the norm of Po is one, that is, 
/3o = 7o, the limiting variance covariance matrix is smaller or equal to the quasi- 
Fisher information (or the Fisher information in the exponential distribution case). 
In general cases, this proposition helps us obtain the final estimator 0:7(6') with 
smaller limiting variance as follows. 
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Theorem 4.2. Under the conditions of Theorem 4-i, when Procedure 1 is applied, 
we have 

x/^(«7(^) - /So) A iV(0, A{eo)WA{eoV), (4.1) 

where W = i ^ vv_i 3^~l%„i ) , V and V are respectively defined in Theo- 

rem 4-i and Proposition 4-1 ■ 



For the estimator obtained by Procedure 2, we state the following theorem. 

Theorem 4.3. Under the conditions of Theorem 4-1, when Procedure 2 is adopted, 
we have 

v^(d^7(^) - /3o) ^ N{0, A{eo)W^A{eo)^), (4.2) 

/ (^Ty^\-l , i^VJ(e)V''Q,V-'J(eo)^Vyo ^T \ 

where Wi = \ ^^° ^^^ ii^v^or ^ ^p-\ \ , and W > 

V Op_i V~'QiV-' 

Wi. 



The proof in the appendix shows that the first element of W is greater than or 
equal to the corresponding element of Wi. Then W > Wi. This indicates that when 
we use the estimator 7 as a plug-in for estimating oq, the resulting estimation does 
benefit from its smaller variance. This suggests that the estimation of a^ directly 
from the classical quasi-likelihood is, at least asymptotically, also worse than the 
new method. We are now in the position to compare the matrices involved. 

Theorem 4.4. Let V2 = A{eofVA{eo). Then V-^ = A(^o)^2"^^(^o)^- We have 
1. All the elements on the diagonal of W (or Wi) are smaller than or equal to the 
corresponding elements ofV~^. 

2. 

traceiy-^) > trace{A{eo)WA{eof){ortrace{A{eo)WiA{eof). 

Remark 4.4. In the proof in the Appendix, we can see that when J{9q)^X/v{X) 
is uncorrelated to [g\j3QX)Y'-^QX, V^^ > W{= Wi). Further, when a stronger 
assumption holds: E{J{9o)'^X/v{X)\'yQX) = 0, we have Qi = V. Thus, Vg""^ = 
W{= Wi), and V~^ = A{6o)W A{6o)^ ■ This shows that the classical quasi-likelihood 
can be asymptotically equally efficient to the two-stage quasi-likelihood when J{6o)^X 
is uncorrelated to 'JqX in certain sense. This is the case when X follows a spherically 
symmetric distribution. Otherwise, the variance obtained by our method is smaller 
than that by the classical quasi-likelihood in the sense stated in the theorems. From 
these phenomenon, we can see that the estimation efficiency of the new method may 
benefit from the correlations among the predictors in certain sense. 
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5 Further discussions 



As this is the first step towards a general idea of using semiparametric approaches 
to study parametric problems for the GLM, there are many potential issues worthy 
of further exploration. 

1. In the previous sections, the conditional variance of e given X is assumed to 
be known. In many GLM models, it is of a structure u(/3(fX) where v{-) is given 
up to a dispersion constant and /3o is the parameter of interest in the regression 
function. In this case, when we use the quasi-likelihood to search for a solution, this 
unknown /3o should be either replaced by an initial estimator or be estimated by the 
same quasi-likelihood. The estimating equation for the initial estimation /3 is 

n 

G{P) = E(2/^ - ^(/3^x,))^'(/3^xOx,A;(/3^x,) = 0, (5.1) 

assuming that v is given, and in Procedure 1, we change the estimating equation 
to obtain the solution of 6*0 by 



SGiiie)) = J2{y, - E{Y\^{efx,))g'{a^{efx,)J{e)^x,/v{a^{efxMx,) = 0, 



j=l 



(5.2) 



or simply use /3 in v(-). From the motivation in Section 2 and the proof of Theo- 
rem 4.1, it is easy to see that the estimator of /3o that is obtained by the modified 
Procedure 1 with (5.2) is asymptotically not affected by any consistent estimation 
ccylO) ( or /3) in v{a'j{6)'^-) (or v{l3'^-)), and similarly for Procedure 2 when we use 

n 

G{a) = ^(y, - g{a^{efx,))g'{a^{efx,)^{efx,/v{a^{efx,) = 0. (5.3) 

Then the asymptotic properties are very similar to those in the previous theorems as 
if !;(■) were given. Actually, the above can be more general with a general function 
w{P'^X) in the lieu oi g' {/3^ X) / v{(5^ X) , and the technical proof can be very similar. 

Further, when v{X) is of a single-index structure as v{ao'yQX) with unknown 
function f (■), we can apply nonparametric smoothing to the squared residuals {yi — 
g{l3'^Xi)Y ^o construct, for any fixed 7, an estimation v{a''j'^X). This is possible 
because this function is the conditional expectation of e^ given 7"^X where /3 is an 
initial estimator obtained by the classical quasi-likelihood as 

n 

G{P) = Y,^y^ - giP^x,))g'{(5^x,)x, = 0, (5.4) 
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As is well known, this /3 is not asymptotically efficient, but still asymptotically 
normal and £>(■) is of a nonparametric convergence rate (see. Fan and Gijbels 1996). 
This helps to define our final estimation of (5q. In Procedure 1, we change the 
estimating equation to obtain the solution of 6*0 by 



SG{i{e)) = Y,{y^ - E{Y\^{9Yx,))g\a^{efx;)j{efx,/v{a^{9fx;)Ux;) = o 



i=l 



(5.5) 



or in (5.5) we simply use J3 in lieu of a'^{9) in the estimated variance function 
v{a''){9)'^-). In Procedure 2, we use 

n 

G{a) = Y,{y^ - 9{ai{9Yx,))g'{a^{9Yx;)^{9fx,/v{a^{9Yxi) = 0. (5.6) 

Again we can also derive the similar results in Section 2. 

In the above two cases with given and unknown function v{-), the matrix Qi, 
compared with its form in Theorem 4.1, has a simpler structure because w is a func- 
tionof /3o: Qi =: J%fQJ{9^) = E{g\l3lXY /v{l3lX)J%Y[X -E{X\^lX)][X- 
E{X\^lX)Y J{9,)}. 

2. The limiting variances are related to re-parametrization. Thus, it is also of 
interest to explore whether there is a re-parametrization so that the variances attain 
the minimum among all possible re-parametrization methods. It is noted that our 
results only need the orthogonality between 7 and its Jacobin matrix. Thus any 
re-parametrization with this property can be used in deriving similar results. 

3. A more fundamental issue is about the asymptotic efficiency in the new frame- 
work we consider. For the GLM, we introduce a nuisance parameter and regard the 
link function as its true value. This is for the purpose of introducing automati- 
cally an extra "estimation" of (?(■) at the true value /3o- When this "estimation" 
is non-positively correlated to the one that can achieve the Fisher information, the 
resulting estimation can have chance to be more efficient. This idea is general in 
principle. Hence, regardless of re-parametrization, could we have an optimal selec- 
tion of extra "estimation" to achieve smallest possible variance? We guess that the 
variance we obtain by Procedure 2 might be of some optimality property. More 
generally, for an estimation of parameter of interest /Sq, say, T^, if we would be able 
to define an extra "estimation" T„i(/3o) at the true value of /3o with non-positive 
correlation to T„ — /3o, the sum (T„ — Pq) +T„i(/3o) rnay have a smaller variance than 
that of (T„ — /So) when 2 time the covariance between them is, in absolute value, 
greater than the variance of T„i(/3o). As such, the key is how to find such an extra 
"estimation" . Our parameter space augmentation approach shows that it is possible 
at least for the GLM when we "intentionally" introduce a nuisance parameter into 
the model. 



4. In this paper, we consider independent identically distributed cases. It may 
be readily extended to handle, say, cluster data and correlated data. 

5. The advantage of the new method is its asymptotic efficiency. Compared 
with classical estimations, its computational cost should be more expensive. This is 
because it involves one- dimensional nonparametric estimation and then we may face 
possible computational inefficiency and instability with tuning parameter selection 
when compared with the classical quasi-likelihood. This deserves further studies for 
finite sample implementation. 



6 Appendix 



6.1 Conditions 

To obtain the asymptotic behavior of the estimators, we ffist give the following 
conditions for technical purpose: 

CI. (i) The distribution of X has a compact support set A. 

(ii) The density function /'y(-) oi'^^X satisfies Lipschitz condition of order 1 
for 7 in a neighborhood of 79. Further, 7(fX has a bounded density function 
/^o(-) on its support T. 

C2. (i) The function g{^'^X) = E(Y\'-f'^X) has two bounded and continuous 
derivatives. 

(ii) Let l{z) = E{X/v{X)\'~fQX = z), and ls{-) is the s-th component of /(■), 
1 < s < p. ls{-) satisfies Lipschitz condition of order 1. 
(iii) Cov{X/v{X)\-f^X = z) exists, maxi<,<pE((X,/t;(X))2|72^X = z) < c. 

C3. (i) The kernel i^ is a bounded, continuous and symmetric probability density 
function, satisfying 

/OO /"OO 

u^K{u)du 7^ 0, / \u\'^K{u)du < 00; 
00 J —00 

(ii) K satisfies Lipschitz conditions on R^. 

C4. E{e\X) = 0,var(e|X) = v{X) < oo,E{e^\X) < 00. 

C5. /i — !> 0, nh'^/log^n —^ 00, lim sup^^o^ n/i^ < c < 00. 

C6. Both V = E[g'\^^X)XX^/v{X)] and V =: J{eofVJ{eo) are positive defi- 
nite matrices, where J {60) is the Jacobian matrix of 8^/89 evaluated at ^ = ^q- 
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Remark 6.1. The Lipschitz condition and the derivatives in CI and C2 are stan- 
dard smoothing conditions. It is worth noticing that unlike other references, we do 
not bound the density function of X'^/Sq from zero. This is because when we use a 
truncation constant cq, we can ensure the truncated density, and then the denomina- 
tors of g{z;'yQ,h) and g'{z;'yo,h) are bounded away from zero. See Xia and Hdrdle 
(2006) as reference. C3 means that the kernel for smoothing is of second order. C4 
is a standard condition on the error term. C6 ensures that the limiting variance for 
the estimator ^{9) exists. 



6.2 Proofs 



As the following lemmas are very similar to those of Wang, Xue, Zhu and Chong 
(2010), or Chang, Xue and Zhu (2010), we will only present the main steps of proofs. 
A relevant reference is Zhu and Xue (2006). The main differences from existing ones 
in these references are with different estimating equations and estimations involved. 

Lemma 6.1. (Wang, Xue, Zhu and Chong 2010).Let C,i{x,'~f), . . . ,^n{x,'y) be a se- 
quence of random variables. Denote fx.-yiVj) = ^j(x,7) for i = l,...,n, where 
Vi, . . . ,Vn be a sequence of random variables, and f^^^ is a function on An, where 
•An = {{x, 7) : (x, 7) G A X RF, II7 — 7o|| < cn~^^'^} and c> is a constant. Assume 
that fx^-y satisfies 



1 " 

-J2\f.,,{V;)-fx*,riyd\<cn^ 

i=l 



7 — 7 II + 11 a; — X 



(A.l) 



for some constants x*,7*, a > and c > 0. Let £„ > depend only on n. If 



1 " 
-1^6(^,7) 




(A.2) 



for (x,7) G An, then we have 



P < sup 



1 " 

-y'6(a;,7) 
n ^-^ 



> -Er 



< c,n^^'^e~'^E{ sup 2exp( T^'^^ ^^ ) A 1 



(A.3) 



where ci> Q is a constant. 

Lemma 6.2. Suppose that conditions C1,C2 and C3(i) hold. If h = cn~"- for any 
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< a < 1/2 and some constants c> 0, then for i = 1, . . . ,n, we have 

n 

EW^x,)- Y, Wnii^^x,-^o,h)gif3^Xi)]' = 0{h^), (A.4) 



n 



E[g{ao-f^x)- ^ iy„,(7^x; 7, %(«o7^x,)]2 = 0(/i^), (A.5) 



i=l,i^j 



E 



Yl ^nji^o^i] -fo)g'{/3oXi)ls{joXi) - g'{P^Xj)h{-foXj) 



0{Vh), 

(A.6) 



where Ig is defined in Condition C2. 



Proof. The basic arguments of proof are from Wang, Xue, Zhu and Chong 
(2010), and it is very similar with Lemma 5.2 in Chang, Xue and Zhu (2010). We 
then provide their main steps. We prove the first conclusion and the others can 
be proven similarly. Denote Zi = 7(fxj, Zi = /3jxi, replacing Sn,r{'yoXi;'jo,h) and 
f/„j(7(fxj; 7o, /i) by Sn,r{zi) and Unj{zi) respectively for simphcity. We have 

Let -E^. [■] denote the conditional expectation given Zi. Denote T„ = Or{an) if 
E\Tn\^ = 0{d'^). Using Cauchy-Schwarz Inequality, we can easily obtain 

Oria^^)Or{bn) = 0,/2(an&n), (A.8) 

T„ = E,JT„] + 0,((E|T„ - E.JTJI'^)!/^. (A.9) 

From condition Cl(ii) we know Mf^^ = sup^ f^/^it) < 00, and there exists a 
constant L > such that for any real y and t, \f-yo{y) — /7o(^)| < L\y — t\- Using 
this facts, when n large enough, we can obtain that for i = 1, . . . , n, 

E,^[SnAzi)] = h'^^^rf^oiz^){l + 0{h)), r = 0, 1, 2 (A.IO) 

where /i^ = J^ u^K{u)du, r = 0, 1, 2, /ig = 1 and /ii = 0. 

By the inequality of sum of independent random variables(see Petrov, 1995), we 
obtain that 

E{\Sn,riZi) - E^^Sn,r{Zi)\^} = E{E^^\Sn,riZi) - E^^Sn,r{Zi)\^} 

/oo 
\u'^K{u)\du{Ef^,Xzi) + Lh) 
'OO 

/oo 
u'^'-K\u)du{f^^{zi) + Lh)] 
-oo 

= 0(^-3/1^'-=^) + 0(r2-2/i2^-2). (A.ll) 
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This together with (A. 9) and (A. 10) proves that 

Sn,M) = E,\S^^r{z,)] + Or{K-{nhr^/^) 

= /x,/z77o(^^)(l + 0,{h + {nhy^/'')). (A.12) 

By (A.8) and (A.12), we have 
1 " 

- 2_^ Uni{Zj ) = Sn,o{Zj ) Sn,2{Zj ) - S^i {Zj ) 

= fi2h'fl{z,){l + 0,{h+{nh)-'^')). (A.13) 



n . 
1=1 



Let Urriz) = (nh^nZtiUmiz) and U{z) = i^2fl{z). Using Lemma 6.1, (A.13) 
and Borel-Cantelh's Lemma, we can prove[refer to Wang, Xue, Zhu and Chong 
(2010)] 

sup \Un{z) -U{z)\^0, a.s. (A.14) 

From condition CI we know iniz^r f-yoi^) > c > 0. thus, when n is large enough, 

inf \Un{z)\ > inf \U{z)\ - sup \Un{z) - U{z)\ > fX2C^/2 > 0, a.s. (A.15) 
zgT zeT zdT 

Write H{zi,Zj) = g{zj)-g{zi)+g'{zj){zi-Zj)ao. Noting thai Yl'i=iUni{zj){zi-Zj) = 
0, we have 

n n 

^Uni{zj){g{Sj) - g{Si)) = ^H{Si,Sj)Uni{zj) 

i=l i=l 

n 

= ^ H{Zi, Zj)Kh{Zi - Zj)Sn,2{Zj) 

i=l 

n 

-^H{zu Zj){zi - Zj)Kh{zi - Zj)Sn,i{zj). (A. 16) 

i=l 

Similar as the proof of (A.12) for r = 0, 1, we have 

1 " 
■^j^ Yl ^(^^' ^j)(^^ ~ ZjY'Khiz^ - Zj) 

i=l 

= h~^~^Ez^[H{2i,Zj)izi - zjYKhizi - z,)] + 04(1) 

=: d„, + 04(l), 2<j<n, (A.17) 

and \dnr\ < c J \u\^+'^K{u)du{f^,{Zj) + 0(h)) = 0(1). This together with (A.12), 
(A. 16) and (A.17) proves 

n 

^ Uniizj){g{zj) - g{zi)) = 7^/1^2/70 (^j)c?„o + 02(^/1^). 
1=1 

22 



Thus, combining again (A. 15), we conclude 



E 



'ZLiUni{zj){g{zj) - g{zi)) 



L^i=\ '-^nii^j) 



< c{nh')- E 



^Uni{zj){g{Sj) - g{Si)) 



i=l 



Oih" 



(A. 4) now follows from (A. 7) and (A. 17). 

Lemma 6.3. Under the conditions of Lemma 6.2, we have 



E\Y,Wl{^''x-^M=0{{nh)-') 



i=l 



E {Er=i.^, wii^i^f 70, h)] = oiinhr^) ■ 



(A.18) 
# 

(A.19) 
(A.20) 



Proof. The proof of Lemma 6.3 is similar as the proof in Lemma 6.2, hence, we 
omit it. # 



Lemma 6.4. Suppose that conditions C1-C4 and C5(i) hold. We then have 
sup \g{l3Qx)- g{-f'^x;-f,h)\ = Op((n/i/ log n)~^/^). 

(a;,7)eAi 



(A.21) 



Proof Write g{xi, £j) = giP^x) - gil^lxi) - e^, z = 1, . . . , ra. We have 

n 

g{l3'^x) - g{-i^x- 7, /i) = J] W^il'^x] 7, h)~g{xi, Ei). (A.22) 



j=i 



Let ^i(x,7) = n{nh/ lognf/'^Wniil'^ x;'^,h)g{xi,ei), fx,^{Vi) =^^(^,7), Vi = {xi,ei), 
i = l,...,n. Now we test and verify (A.l) and (A. 2) in Lemma 6.1. A simple 
calculation yields (A.l). For (A. 2), by Lemma 6.2 and Lemma 6.3(A.19) and noting 
that sup(^_^)g^^ \g{ao-f'^x) - g{x^l3o)\ = 0{n-^/^), 7 = /3/||/3||, we have 

E[gi(3^x)-g{j''x;j,h)f 

n 

= E[^Wni{j'^x;'j,h)g{xi,ei)f 
1=1 

n 

< cE[g{P^x) - J2 Wniil^x- 7, h)g{f]^Xi)]' 
1=1 



+cE[J2w^,{^^x-^,h)]+0{n-'] 



i=l 



< ch'^ + c{nhy\ 



(A.23) 
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Given a M > 0, by Chevbychev's inequality and (A. 23), we have 



1 " 
n ^-^ 

i=l 



>-M\< AM-^E 



1 " 
n ^-^ 

i=l 



< 4M-'^nh{\ogny^E 



-\ 2 



Y^ Wniix^'^; 7, h)g{Xi, e^ 



i=l 



< cM-^{cnh^ + c(logn)"^). 



(A.24) 



Therefore, from condition C5, we can choose M large enough so that the right hand 
side of (A.24) is less than or equal to -. Hence, (A. 2) is satisfied. We now can use 
(A.3) of Lemma 6.1 to get (A. 21). By (A. 19), we obtain that 



n 



Y,E^ii^,l) = n/i(logn)-^^E[iy„,(7'^x;7,/i)^(x„£,)]^ 



i=l 



n 



< cnh{\ognr'J2^^n^il''^■,l,h) 



i=l 



< c(logn) ^. 



This implies that n "^ Yll=i ^i {^ y l) = Op((logn) ^). Hence, from Lemma 6.1 we 
have 



P { sup 



1 " 

-5Ze.(a:,7) 
1=1 



>-M\< cn^P'^M-^P exp{-cM'^ log n] 



The right hand side of the above formula to zero when M is large enough. There, 
(A.21) is shown. # 



Lemma 6.5. Suppose that conditions C1-C6 are satisfied. Then we have 

sup \\R{e) - u{eo) + nVi{e - eo)\\ = op(v^), (A.25) 

where B* = {9 : ||^ — ^oll ^ cn^^^"^}, for a constants c> 0, Vi is defined in condition 
C6, and 



R{0) = Y^y^-9{l{^fx^\0,h)]g'{a^{9Yx,)J{eYx,/v{x;) (A.26) 

i=l 

n 

U%) = Y,^^g\P'^Xi)J{eof[x,/v{xi)-E{X/v{X)\-f'^Xi)]. (A.27) 



j=i 



as our denotation in section 3, /3o = ao70j 7o = 7(^0)? one? a is a consistent estimate 
of ao in section 3. 
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Proof. Separating R{6), we have 

n 

R{e) = Y,^^9'{Po^^)J^mx.,/v{x,)-E{x,/v{x,)\-f^x,)] 
1=1 

n 

+ ^ei[g'{a'-f{eYxi) - g'{ao'-iQXi)\j'^{0)xi/v{xi) 

i=l 
n 

- Y,9\l^o^i)J^{0)x,/v{x{)[g{^{9fx,-9,h) - g{^lx,-^oM 

i=l 

n 

- ^Y^9'{filxi).f{Q){xilv{xi)\g{-ilxi\-iQ,K) - g{filxi)\ -ei/(7(fxi)} 



n 



- 5^[^(7(^)'^a;,; 0, h) - g{l3]^Xi)][g'{a^{eYx,) - g'{a,^'^Xi)]J^{e)xJv{xi) 

i=l 

=: R^{e) + R^{e) - Rs{9) - R,{e) - R,{e), (A.28) 

where [{'y'^X) = E{X/V{X)\'~fQX) which is defined in C2. Ri{9) can be written as 

n 

RM = Y,''^9\aol{0oYx,)J{e,f[xJv{x;)-E{X/v{X)\^'^x,)] 

i=l 
n 

+ Y,e,g'{ao^{9ofx,){J{e) - J{eo)f[x,/v{x,) - E{X/v{X)\^'^x,)] 

i=l 

Note that \\J{6) - 7(^0)11 = Opin-^'"^), for all 6 e B\ then by the law of large 
numbers, we have 

sup ||i?i(^) - f/(0o)|| = op(v^). (A.29) 



For R2{e), as hW -7(^o)|| = Op(n-V2), ||d7(^) - ao7(^o)|| = 0^(^-1/2), for 
all 9 e B* . By the Taylor expansion of g'{a'y{9)'^Xi) — g'{ao'y{9o)'^Xi), and the law 
of large numbers, we have 

sup||i?2(^)|| =Op(v^) (A.30) 

0€B* 



By Taylor expansion of ^, we have for R3{9) 

n 

i=l 

n 

= $^^'(ao7(^o)^a;,)J(eo)^x,/^;(x,)xf[^'(7(^o)^x,;^o,/i)(7W -7(^o))] +Op(v^) 
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w\iemg{^{dofxi-d^,h) = E{Y\^{dQYx,). Known that E'(F|7(^o)'^a:i) A E'{Y\^{eQfxi), 



and actually E'(F|7(6'o)^Xi) = aSS^TT I'rC^o)^^;-^ = ao5''(«o7(^o)^a;i). So, we obtain 



that 

n 



i=l 



by Weak Law of Large Numbers, we can derive 

svip\\Rs{e)-nV,ie-eo)\\ = o,{V^) (A.31) 

where Vi is defined in C6. 

For Ri{9), using the similar method as in Lemma A. 5. of Chang, Xue and 
Zhu(2010), using our Lemma 6.2 and Lemma 6.3, write R^iO) = J^{6)Rl{6). Let 
R^g denote the sth component of Rl{6), we can derive that 

-E {R*J) = -E (J29'il3oXi){^^s/v{x,MJ^Xi) - giP^Xi)] - eMloX^)^■m 

(n n ^ 

^g'{f3oXi)xis/v{xi)[^Wnj{-foXi)g{f3QXj) - g{(3QXi)] 

2 



n n 



+ cn ^E i ^g'{(3QXi)[xis/v{xi) ^ Wnj{'yoXj)ej - ei/,(7(fxi)] 



IZT'/'D* ^2 I ^^-It-i/d* ^2 



=: cn-^E{Rl^y + cn~^E{K] 



As2) 



E{RUf < c^ Eg'\p^x,)^ W^,{^];x,)g{P^x,) - g{P^Xi)fE{{x,Jv{xi)f\^l 
1=1 \ j=i 

by C(iii) and Lemma6.2 , we have 

E{Rlsif < cnh^ (A.33) 

For Rl, 

n / n 

n / n 

< cJ^eI v{x,){Y,9'{Po^j)Wm{lo^j)ls{lo^j) - g\Plxi)h{^^x,)f 

i=i V j=i 

n / n 

+ ^^^ [ '"i^i)C^9'{/3oXj)Wni{-foXj){XjJv{Xj) - ls{l'^Xj))f 
i=l \ j=l 
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T 

Xi 



BY Lemma 6.2, C(iii) and Lemma 6.3, we can obtain 

EiRl,2? < cnVh + cn{nhy^ (A.34) 

Together with (A.33), (A.33) and (A.34) we get 

-E {R*J) < ch^ + cVh + c{nhy^ -^ 0. 



This imphes 



sup ||i?4(^)|| =Op(v^). (A.35) 

6»gB* 



For R^iO), by Taylor expansion to g\-), together with || 67(6') — 007(6*0) || = 
Op(n"^/^), for all 6 e B* , we have 

n 
i=l 

By Lemma 6.4 and Condition C5, we obtain 

sup ||i?5(/3^'^)|| =Op(v^)- (A.36) 

/3('-)g/3* 

Together with (A. 29)- (A.36) we conclude the proof of Lemma 6.5. ^ 

Remark 6.2. When we consider (3.8) with In{xi), as \\In{xi) — In{xi)\\ — )■ can be 
proved, we can first substitute In{xi) with In{xi), and then use similar arguments as 
above, which will lead to Remark 4-1 right below Theorem 4-i- 



Proof of Theorem 4-i The existence of the estimator is easy to prove. As the 
estimating equation is similar with the one in Chang, Xue and Zhu(2010), and then 
the arguments are also very similar. We omit the proof here. 

Then we show the asymptotic normality. '-f{6) is the solution of (3.8), that is, 
R{6) = 0. By Lemma 6.1, we have 

R{0) = u{e,) - nv^{e - e^) + o,{^) = o, 

and hence 

^{e - ^o) = V{^n~^'^U{do) + 0^(1). 
It follows from (3.3) and (3.5) that 

7(^) - 7(^0) = J{e^){e - e,) + o,{n~^/^). 
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Thus, we have 

y^(7(^) - 7(^0)) = A9o)Vr'n~'/'Ui9o) + 0^(1). 
Theorem 4.1 now follows from Central Limit Theorems and Slutsky's Theorem. □ 

Proof of Proposition 4-1 By the definition of aj, it is easy to see that 

Then it is clear that the limiting variance is 7(f^~^7o- For 7 = /3//||/3/||, some 
elementary calculation yields that 

7-70 = (/3//||/3/||-/3o/||/3o||) = (/^7||/3o||-/3o||/3/||)/(||/3/||||/3o||) 
= {{Ip - lolo)0i - (3o))/m\ + o(l/v^). 

From this presentation, we have the limiting variance: 

Note that V-' = A{9o){A{9ofVAieo)r'Aieof, and iIp-^o7^)A{eo) = {Op, J(^o)). 
The result follows. □ 

Proof of Theorem 4-2 As both d/ and 7(^) are respectively convergent to ao and 
7o, it is easy to see that a/7 (^) —/So = («/ — ao)7o + tto(7(^) — 7o) + Op(l/v^)- From 
the standard proof for the asymptotic representation of f3j by the quasi likelihood, 
we have the following result. Noting that a = \\f3[\\ and oq = ||/3o||, and together 
with the proof of Proposition 4.1, the estimator by the quasi- likelihood satisfies 

V^(ll/3/|| - l|/3o||) = i^V^0i-(3o) + Op{l) 



1 " 
V-'^J2^i9'W^x,)xi/v{x,)+Op{l). (A.37) 



T 

^° 'n 

i=l 



Then the application of Central Limit Theorems yields that it converges in distribu- 
tion to a normal distribution with mean zero and variance 7(f V^~^7o- Further, from 
the proof of Theorem 4.1, we have 

A/n(7(^) - 7o) 

= J{9o)Vf'^J2^^9'{f3^x,)J{9of[xM^,) - E{X/v{X)\j^x,)] 

+Op(l). (A.38) 
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Note that A(^o)^7o = (l,Ojli)^ and A(^o)^^(^o) = (Op_i, J(^o)'^^(^o))^- Thus, our 
estimator /3 = 07(6') has 

^0 - /3o) = V^{A{e,f)-^A{e,fCP - Po) 

n 

+Op(l). (A.39) 

Central Limit Theorems imply the asymptotic normality with mean zero and variance- 
covariance matrix 






= A{9o)WA{9of. 

The last equation holds because (A(9of)-^ (J ,,. ,Vi,. xx I = (A(9ofy^A(9oVA(9o). 

\ Up_i J[t>o)) J[(yo>) J 

The conclusion is proved. q 

Proof of Theorem 4-3. From the proof of Theorem 4.2, we only need to prove that 
(7(f^7o)~"'^ + "^^^ — — r^ — )2^ — "^° ° is the limiting variance of the estimator ap- 
From (2.2), we can derive that 



1 " 



Xi/w(Xi 



7ifV^v^(«F7W-/3o) + Op(l) 

loV'joVn{aF - «o) + ao7ir^AA(7(^) - 7o) + Op{l) 



Together with (A. 38), we have 

1oV'yo^/n{dF - ao) 
1 " 

-1= X] ^i9\f3oXih{9o)'^Xi/v{Xi) 



n . 
1=1 



r^VJ{9o)V-'^f2e,g'i(3^x,)J{9o 



7o' yj(^o)V^-i_ > e,g'{/3ix,)J{9oy [xi/v{x,) - E{X/v{X)\^^Xi] 



The right hand side is of a limiting variance as 

loVlo + loVJ{9o)V"'QiV~'j{9ofV-fo - 2-f^VJ{9o)V-^J{9ofQ2lo, 



29 



is the limiting variance-covariance matrix of 7(fV^7oA/^(«F ~ Q^o)? where Q2 
E {g'^{P'^X){X/v{X) - E{X/v{X)\-ilX))X^). It is easy to prove that 

C-ilVJ{eQ)VJ{eQfQ2lQ = 0. The proof is completed. 



D 



Proof of Theorem 4-4- From the structure of V it is easy to see that V2 can be 

written as 

Further, we have 



V2 - A{e,) vA{e,) - ( j^S.fy^, J{e,Yvj{e,) I - \ v^, v 



y-l ^ f V,\^ + Vi7Vi2^2.\^21 -^ll ^1^2^224 
V -^22-1^21^11 V22.1 

where V22-i = V — V2iV^i7"^^i2- Note that V2iV^iY^Vi2 is a non-negative semidefinite 
and V is positive definite. Thus, V22.1 < V and ^224 > V~^ > V~^QiV~^ as 
Qi < V. Note that V-^ = A(^o)V^2~^^(^o)^, and then -f^V-^-fo = (^"^)ii = 
^il^ + ^ii^^i2^2-\^2i- These result in that trace(l^2~^) > trace{W). We then derive 

that, noting that A{9o)^ A{9o) = I ^ j,^ \Tt/-o \ I is a diagonal matrix, 

V Up-i J^tlo) J{ao) J 



traceiy-^) = trace(A(^o)^"^^(^i 



0) 



\T\ 



= trace{V2-^A{9ofA{9o)) 

= V{l' + V^I Vi2l^22.\^2i + trace{V22\J{9ofJ{9o)) 

> loV'lo + trace{V-'QiV-'j{9ofj{9o)) 

= trace{A{9o)WA{9of). 

For Wi we have 

(7o yio) + / tt/ \2 

(7o Vior 

= V^l' + Vi2V-^QiV-'V2iVf^^ < Vn' + ^i2V^"V2iV^n2 < Vn' + ^i2V^22.\^2iV^n'. 
These result in that trace{V2^) > trace{Wi) and then 

trace{V~^) > trace{A{9o)WiA{9of). 
The proof is complete. q 
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